This notebooks performs an exploratory data analysis for the three world development indicators, namely: Access to electricity, Hospital beds and Gross Domestic Product. The data was retrieved from: DataBankWorld Development Indicators.

# Import the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import seaborn as sbn
import plotly.graph_objects as go
import plotly.express as px
Data was reimported from the preprocessing notebook resulst and loaded back to this notebook.
gdp_df = pd.read_csv("data/gdp_one_df.csv", low_memory=False)
hbd_df = pd.read_csv("data/hbd_one_df.csv", low_memory=False)
ate_df = pd.read_csv("data/ate_one_df.csv", low_memory=False)
# We create a dataframe from the combined dataframes that we have previously loaded.
all_df = gdp_df.copy(deep=True)
all_df["HBD"] = hbd_df["HBD"]
all_df["ATE"] = ate_df["ATE"]
all_df.tail(2)
| CountryName | CountryCode | Year | GDP | HBD | ATE | |
|---|---|---|---|---|---|---|
| 1138 | Vanuatu | VUT | 2021 | 9.834693e+08 | 0.0 | 0.0 |
| 1139 | Vietnam | VNM | 2021 | 3.626375e+11 | 0.0 | 0.0 |
# Export merged data for backup
all_df.to_csv("data/all_df.csv", index=False)
# Create a list of all countries
list_countries = all_df['CountryName'].unique().tolist()
# Create a list of all the years in the data set
list_years = all_df['Year'].unique().tolist()
Let's take a quick look at our data using these interactive plots!
pal = list(sbn.color_palette(palette='viridis', n_colors=len(list_countries)).as_hex())
fig = go.Figure()
for country,p in zip(list_countries, pal):
fig.add_trace(go.Scatter(x = list_years, y = all_df[gdp_df['CountryName']==country]['GDP'], name = country, line_color = p, fill=None))
fig.show()
pal = list(sbn.color_palette(palette='viridis', n_colors=len(list_countries)).as_hex())
fig = go.Figure()
for country,p in zip(list_countries, pal):
fig.add_trace(go.Scatter(x = list_years, y = all_df[gdp_df['CountryName']==country]['HBD'],name = country, line_color = p, fill=None))
fig.show()
pal = list(sbn.color_palette(palette='viridis', n_colors=len(list_countries)).as_hex())
fig = go.Figure()
for country,p in zip(list_countries, pal):
fig.add_trace(go.Scatter(x = list_years, y = all_df[gdp_df['CountryName']==country]['ATE'],name = country, line_color = p, fill=None))
fig.show()
It is best if we describe our data so we would get an idea of how it is distributed, the next few images will help us with understanding the data.
all_df.describe()
| Year | GDP | HBD | ATE | |
|---|---|---|---|---|
| count | 1140.000000 | 1.140000e+03 | 1140.00000 | 1140.000000 |
| mean | 2015.500000 | 5.044162e+11 | 2.96679 | 87.553240 |
| std | 3.453568 | 1.431225e+12 | 2.60405 | 29.149339 |
| min | 2010.000000 | 3.210420e+07 | 0.00000 | 0.000000 |
| 25% | 2012.750000 | 7.447009e+09 | 0.82250 | 97.897501 |
| 50% | 2015.500000 | 5.621421e+10 | 2.67000 | 100.000000 |
| 75% | 2018.250000 | 4.078550e+11 | 3.86750 | 100.000000 |
| max | 2021.000000 | 1.773406e+13 | 16.46000 | 100.000000 |
all_df.dtypes
CountryName object CountryCode object Year int64 GDP float64 HBD float64 ATE float64 dtype: object
# Checking the distribution of our data using histograms
all_df.hist(bins=5, figsize=(20, 10));
# Checking for correlation between our columns
fig = px.imshow(all_df.corr())
fig.show()
# Another way of checking relationships between our columns
sbn.pairplot(all_df)
plt.show()
# Checking column relationships per country
# This may be a little hard to interpret due to the multiple countries listed,
# we can reimpliment this for each country to better analyze.
sbn.pairplot(all_df, hue='CountryName')
plt.show()
There are a few outliers in our data but removing them may be less helpful since we have time series dataset with yearly time grain which means our data is very limited as it is.
all_df.plot(kind='box', subplots=True, layout=(2,7),sharex=False,sharey=False, figsize=(20, 10), color='darkorange');
In the excel sheet, we tried to create a forecast for GDP per country, let's try and do that quickly by using all of our data and see if we can get an accurate model to predict our GDP.
# Let's start by creating a copy of our dataset
pred_df = all_df.copy(deep=True)
pred_df.drop(columns=["CountryCode"],axis=1, inplace=True)
pred_df.tail(5).T
| 1135 | 1136 | 1137 | 1138 | 1139 | |
|---|---|---|---|---|---|
| CountryName | Ukraine | United Kingdom | Uzbekistan | Vanuatu | Vietnam |
| Year | 2021 | 2021 | 2021 | 2021 | 2021 |
| GDP | 200085537744.354004 | 3186859739185.02002 | 69238903106.173798 | 983469256.849629 | 362637524070.968994 |
| HBD | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ATE | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
pred_df.CountryName.value_counts()
Albania 12
Netherlands 12
Poland 12
Philippines 12
Papua New Guinea 12
..
Greenland 12
Greece 12
Gibraltar 12
Germany 12
Vietnam 12
Name: CountryName, Length: 95, dtype: int64
# Turn categorical variables into numbers or categorical codes
for label, content in pred_df.items():
if pd.api.types.is_string_dtype(content):
# Turn categories into numbers and add
pred_df[label] = pd.Categorical(content).codes
# Let us create a sample to test if we can fit our data into a model
X_train, y_train = pred_df.drop("GDP", axis=1), pred_df.GDP
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_jobs=-1, random_state=42, n_estimators = 100, max_samples=100)
model.fit(X_train, y_train)
RandomForestRegressor(max_samples=100, n_jobs=-1, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestRegressor(max_samples=100, n_jobs=-1, random_state=42)
# Create training and validation dataset
df_val = pred_df[all_df.Year == 2021]
df_train = pred_df[all_df.Year != 2021]
len(df_val), len(df_train)
(95, 1045)
# Split data into X & y
X_train, y_train = df_train.drop("GDP", axis=1), df_train.GDP
X_valid, y_valid = df_val.drop("GDP", axis=1), df_val.GDP
X_train.shape, y_train.shape, X_valid.shape, y_valid.shape
((1045, 4), (1045,), (95, 4), (95,))
# Create function to evaluate our model
from sklearn.metrics import mean_squared_error, mean_absolute_error
model_accuracies = []
def show_scores(estimator, model):
"""
Displays the MAE, RMSE and R-squared of both training and validation.
This function also appends each result to the model_accuracies list for better comparison.
"""
train_preds = model.predict(X_train)
val_preds = model.predict(X_valid)
scores = {"Estimator": estimator,
"Training MAE": mean_absolute_error(y_train, train_preds),
"Valid MAE": mean_absolute_error(y_valid, val_preds),
"Training RMSE": np.sqrt(mean_squared_error(y_train, train_preds)),
"Valid RMSE": np.sqrt(mean_squared_error(y_valid, val_preds)),
"Training R^2": model.score(X_train, y_train),
"Valid R^2": model.score(X_valid, y_valid)}
model_accuracies.append(list(scores.values()))
return scores
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import ElasticNet
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
# Create a dictionary containing multiple models/estimators
dict_regressors = {
"GradientBoostingRegressor": GradientBoostingRegressor(random_state=42),
"ElasticNet": ElasticNet(random_state=42),
"LinearBayesianRidge": linear_model.BayesianRidge(),
"LinearRegression": LinearRegression(),
"RandomForestRegressor": RandomForestRegressor(n_jobs=-1, random_state=42, max_samples=1000)}
# Re-initialize the model_accuracies list
model_accuracies = []
%%time
# Loop through the regression model dictionary, fit to our training data and score the models.
for model, model_instantiation in dict_regressors.items():
current_model = model_instantiation
current_model.fit(X_train, y_train)
show_scores(model, current_model)
CPU times: total: 1.91 s Wall time: 925 ms
# Let us convert our model_accuracies list into a DataFrame for better readability
model_accuracies_df = pd.DataFrame (model_accuracies, columns = ['Estimator', 'Training MAE', 'Valid MAE', "Training RMSE", "Valid RMSE", "TrainingR^2", "ValidR^2"])
model_accuracies_df
| Estimator | Training MAE | Valid MAE | Training RMSE | Valid RMSE | TrainingR^2 | ValidR^2 | |
|---|---|---|---|---|---|---|---|
| 0 | GradientBoostingRegressor | 1.657705e+11 | 3.620254e+11 | 2.653431e+11 | 7.644505e+11 | 0.962662 | 0.845397 |
| 1 | ElasticNet | 6.225162e+11 | 8.745296e+11 | 1.335221e+12 | 2.122869e+12 | 0.054543 | -0.192245 |
| 2 | LinearBayesianRidge | 6.313179e+11 | 6.573992e+11 | 1.373195e+12 | 1.954571e+12 | 0.000000 | -0.010699 |
| 3 | LinearRegression | 6.231085e+11 | 8.332307e+11 | 1.334854e+12 | 2.105567e+12 | 0.055063 | -0.172889 |
| 4 | RandomForestRegressor | 4.777562e+10 | 4.537450e+11 | 1.336596e+11 | 1.018737e+12 | 0.990526 | 0.725436 |
model_accuracies_df.plot(title='Mean Absolute Error', x='Estimator', y=['Training MAE', 'Valid MAE'], figsize=(10,5), grid=True, kind='bar');
Our errors are high but let's use the least erroneous models and disect how they analyzed our data by checking the predictor importance below.
dict_regressors["GradientBoostingRegressor"]
GradientBoostingRegressor(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GradientBoostingRegressor(random_state=42)
# Get the model feature importance for the GradientBoostingRegressor
sbn.set(rc={"figure.figsize":(15,4)})
sbn.barplot(x= dict_regressors["GradientBoostingRegressor"].feature_importances_, y=X_train.columns);
# Get the model feature importance for the RandomForestRegressor
sbn.set(rc={"figure.figsize":(15,4)})
sbn.barplot(x= dict_regressors["RandomForestRegressor"].feature_importances_, y=X_train.columns);
All the steps above should have given us some familiarity of how our data behaves.